3.2 Basic NLP and Text Checks

Prior to diving into our natural language processing analysis, we performed a series of fundamental text examinations and analyses on the dataset.

3.2.1 Distribution of Text Length

Using a user-defined function to determine the length of each document, we analyze the distribution of text length across submissions and comments. On average, posts related to dogecoin have approximately 110 words, the average comment length is 11.9 words, and the title length is an average of 9.3 words.

Table 1. Summary statistics of post/comment lengths

Text Average Maximum Minimum
Posts 109.98 5532 1
Comment 11.9 1404 1
Title 9.3 80 1

The following histogram plots the distribution of text lengths in posts and comments, colored according to the subreddit where it is posted. As the following diagram shows a striking contrast - the majority of very short length posts are more prevalent in r/dogecoin, while r/CryptoCurrency posts are on the longer end of the distribution. This might indicate that r/CryptoCurrency has higher-quality or higher-information posts than r/dogecoin.


Figure 1. Histogram of post lengths for both subreddits

3.2.2 Frequent Words

By breaking down and spreading out the words from the clean output of our pipeline, we counted the most commonly occurring words for both submissions and comments.

Figure 2: Top 10 most frequently used words

Note: (based on a sample)

3.2.3 URLs

Based on regex-based search of URLs, we find that a higher percentage of posts and comments in r/CryptoCurrency contain URLs than those in r/dogecoin. This hints - but does not confirm - our hypothesis that posts in r/dogecoin may have lesser quality information, or lesser citations and external links to support their information.

Figure 3: Bar graph of share of posts containing URLs